We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
A primer on particulate matter air pollution can be found here.
Your assignment should be completed in Quarto or R Markdown.
#Steps
1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
Date Source Site ID POC Daily Mean PM2.5 Concentration Units
1 01/01/2022 AQS 60010007 3 12.7 ug/m3 LC
2 01/02/2022 AQS 60010007 3 13.9 ug/m3 LC
3 01/03/2022 AQS 60010007 3 7.1 ug/m3 LC
4 01/04/2022 AQS 60010007 3 3.7 ug/m3 LC
5 01/05/2022 AQS 60010007 3 4.2 ug/m3 LC
6 01/06/2022 AQS 60010007 3 3.8 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
1 58 Livermore 1 100
2 60 Livermore 1 100
3 39 Livermore 1 100
4 21 Livermore 1 100
5 23 Livermore 1 100
6 21 Livermore 1 100
AQS Parameter Code AQS Parameter Description Method Code
1 88101 PM2.5 - Local Conditions 170
2 88101 PM2.5 - Local Conditions 170
3 88101 PM2.5 - Local Conditions 170
4 88101 PM2.5 - Local Conditions 170
5 88101 PM2.5 - Local Conditions 170
6 88101 PM2.5 - Local Conditions 170
Method Description CBSA Code
1 Met One BAM-1020 Mass Monitor w/VSCC 41860
2 Met One BAM-1020 Mass Monitor w/VSCC 41860
3 Met One BAM-1020 Mass Monitor w/VSCC 41860
4 Met One BAM-1020 Mass Monitor w/VSCC 41860
5 Met One BAM-1020 Mass Monitor w/VSCC 41860
6 Met One BAM-1020 Mass Monitor w/VSCC 41860
CBSA Name State FIPS Code State County FIPS Code
1 San Francisco-Oakland-Hayward, CA 6 California 1
2 San Francisco-Oakland-Hayward, CA 6 California 1
3 San Francisco-Oakland-Hayward, CA 6 California 1
4 San Francisco-Oakland-Hayward, CA 6 California 1
5 San Francisco-Oakland-Hayward, CA 6 California 1
6 San Francisco-Oakland-Hayward, CA 6 California 1
County Site Latitude Site Longitude
1 Alameda 37.68753 -121.7842
2 Alameda 37.68753 -121.7842
3 Alameda 37.68753 -121.7842
4 Alameda 37.68753 -121.7842
5 Alameda 37.68753 -121.7842
6 Alameda 37.68753 -121.7842
tail(data_2022)
Date Source Site ID POC Daily Mean PM2.5 Concentration Units
59751 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
59752 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
59753 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
59754 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
59755 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
59756 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
59751 19 Woodland-Gibson Road 1 100
59752 21 Woodland-Gibson Road 1 100
59753 33 Woodland-Gibson Road 1 100
59754 99 Woodland-Gibson Road 1 100
59755 77 Woodland-Gibson Road 1 100
59756 6 Woodland-Gibson Road 1 100
AQS Parameter Code AQS Parameter Description Method Code
59751 88101 PM2.5 - Local Conditions 145
59752 88101 PM2.5 - Local Conditions 145
59753 88101 PM2.5 - Local Conditions 145
59754 88101 PM2.5 - Local Conditions 145
59755 88101 PM2.5 - Local Conditions 145
59756 88101 PM2.5 - Local Conditions 145
Method Description CBSA Code
59751 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59752 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59753 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59754 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59755 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59756 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
CBSA Name State FIPS Code State
59751 Sacramento--Roseville--Arden-Arcade, CA 6 California
59752 Sacramento--Roseville--Arden-Arcade, CA 6 California
59753 Sacramento--Roseville--Arden-Arcade, CA 6 California
59754 Sacramento--Roseville--Arden-Arcade, CA 6 California
59755 Sacramento--Roseville--Arden-Arcade, CA 6 California
59756 Sacramento--Roseville--Arden-Arcade, CA 6 California
County FIPS Code County Site Latitude Site Longitude
59751 113 Yolo 38.66121 -121.7327
59752 113 Yolo 38.66121 -121.7327
59753 113 Yolo 38.66121 -121.7327
59754 113 Yolo 38.66121 -121.7327
59755 113 Yolo 38.66121 -121.7327
59756 113 Yolo 38.66121 -121.7327
'data.frame': 59756 obs. of 22 variables:
$ Date : chr "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 3 3 3 3 3 3 3 3 3 3 ...
$ Daily Mean PM2.5 Concentration: num 12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily AQI Value : int 58 60 39 21 23 21 13 38 59 55 ...
$ Local Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily Obs Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS Parameter Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS Parameter Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method Code : int 170 170 170 170 170 170 170 170 170 170 ...
$ Method Description : chr "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
$ CBSA Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State FIPS Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County FIPS Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site Longitude : num -122 -122 -122 -122 -122 ...
nrow(is.na(data_2022$"2022 Daily Mean PM2.5 Concentration"))
NULL
summary(data_2022[,5])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.700 4.100 6.800 8.428 10.700 302.500
sum(data_2022[,5] <0)
[1] 215
The 2002 data has 15,976 rows of observations, while the 2022 data has 59,756 rows of observations. Both of these data sets have 22 columns, matching in variable names and data types presented for these data points. There are also no missing values for the key variable we are testing for neither of the data points.
Values for 2002 PM 2.5 daily concentrations were not concerning, with a minimum daily value of 0, and a maximum of 104.3. There are 215 measurements that read negative values for the 2022 PM 2.5 Daily mean, with the minimum concentration being -6.7, which is likely a measurement error.
2. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
3. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
library(leaflet)loc.pal <-colorFactor(c('darkgreen','goldenrod'), domain=mergedData$year)#map includes both yearssitemap <-leaflet(mergedData) |>addProviderTiles('CartoDB.Positron') |>addCircles(lat =~lat, lng=~lon,label =~paste0(year), color =~loc.pal(year),opacity =1, fillOpacity = .5, radius =500 ) |>setView(lng =mean(mergedData$lon, na.rm =TRUE), lat =mean(mergedData$lat, na.rm =TRUE), zoom =5) |>addLegend('bottomleft', pal=loc.pal, values=mergedData$year,title='Year', opacity=1)sitemap
#map for 2002map2002 <-leaflet(data_2002) |>addProviderTiles('CartoDB.Positron') |># Some circlesaddCircles(lat =~lat, lng=~lon,# HERE IS OUR PAL!label =~paste0(year), color =~loc.pal(year),opacity =1, fillOpacity = .5, radius =500 ) |>setView(lng =mean(data_2002$lon, na.rm =TRUE), lat =mean(data_2002$lat, na.rm =TRUE), zoom =5) |># And a pretty legendaddLegend('bottomleft', pal=loc.pal, values=data_2002$year,title='Year', opacity=1)map2002
#map for 2022map2022 <-leaflet(data_2022) |>addProviderTiles('CartoDB.Positron') |># Some circlesaddCircles(lat =~lat, lng=~lon,# HERE IS OUR PAL!label =~paste0(year), color =~loc.pal(year),opacity =1, fillOpacity = .5, radius =500 ) |>setView(lng =mean(data_2002$lon, na.rm =TRUE), lat =mean(data_2002$lat, na.rm =TRUE), zoom =5) |># And a pretty legendaddLegend('bottomleft', pal=loc.pal, values=data_2022$year,title='Year', opacity=1)map2022
Though most of the 2002 locations seem to have remained the same, there are many more locations for the 2022 data sites. Most of the clustering occurs around the coast, the majority of the sites being near the Los Angeles area and San Francisco. The least amount of sites can be found toward the middle/east of the state.
4. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
hist(mergedData$PM)
boxplot(mergedData$PM)
summary(mergedData$PM)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.70 4.50 7.60 10.05 12.20 302.50
highvalues %>%ggplot(mapping =aes(x = month, y = PM, fill =factor(year))) +geom_bar(position ="dodge", stat ="identity")
The histogram and box plot of daily averages show that the majority of the values are located between 0 and 25, though, as mentioned earlier, there are 215 values below 0 that should not be there, with the highest value being in the month of November. Similarly, we can see that there are many out liars that can be cause of concern, but not necessarily impossible values. With 144 values above 75 ug/m3 LC, the majority of those high values were taken in 2002 with a high of 104.3, the highest values in the data set are all from 2022, from June to November.
5. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
- state
library(ggplot2)mergedData|>ggplot(mapping =aes(x = State, y = PM, fill = year)) +geom_bar(position ="dodge", stat ="identity") +scale_fill_brewer(palette ="Accent")
mergedData|>ggplot(mapping =aes(x = State, y = PM, fill = year)) +geom_boxplot() +scale_fill_brewer(palette ="Accent")
mergedData|>ggplot(mapping =aes(x = State, y=PM, color = year)) +geom_point(position ="jitter", aes(alpha =0.5)) +scale_fill_brewer(palette ="Accent") +geom_smooth(method = lm, se =FALSE, col ="black") +theme(axis.text.x =element_text(angle =90))
`geom_smooth()` using formula = 'y ~ x'
Looking at just the histogram, PM 2.5 Levels have increased its maximum levels from 104.3 to 302.5 in 2022 from 2002. However, other graphs show that the median was much higher in 2002 than 2022, with the majority of the data being almost half of the values from 2002. However, 2022 has much higher outliars than 2002, by almost 200 ug/m3 LC.
library(ggplot2)library(RColorBrewer)mergedData |>ggplot(mapping =aes(x = County, y = PM, fill = year)) +geom_bar(position ="dodge", stat ="identity") +scale_fill_brewer(palette ="Pastel2")+theme(axis.text.x =element_text(angle =90))
mergedData|>ggplot(mapping =aes(x = County, y=PM, color = year)) +geom_point(position ="jitter", aes(alpha =0.5)) +scale_fill_brewer(palette ="Pastel2") +theme(axis.text.x =element_text(angle =90))+facet_wrap(~ year, nrow =2)
which(highvalues$PM>295 )
[1] 143 144
Looking at the counties, we can see that 2002 data was much more evenly distributed than 2022 data. The highest value for 2002 was found in Kern county, while we can see consistently lower values in the majority of counties than 2002, the variance in the outliars is much higher for the 2022 values, with the highest value in the Siskiyou county, closely followed by Placer County. - site in Los Angeles
lacounty<- mergedData %>%filter(County =="Los Angeles")lacounty |>ggplot(mapping =aes(x = year, y = PM, fill = year)) +geom_bar(position ="dodge", stat ="identity")+scale_fill_brewer(palette ="Pastel1")
lacounty|>ggplot(mapping =aes(x = year, y = PM, fill = year)) +geom_boxplot() +scale_fill_brewer(palette ="Pastel1")
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.20 7.90 11.40 13.32 16.00 72.40
summary(lacounty[lacounty$year ==2002, "PM"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.60 11.10 17.40 19.66 25.50 72.40
summary(lacounty[lacounty$year ==2022, "PM"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.20 7.40 10.30 10.97 13.70 56.00
The data for LA County shows that the highest concentrations of PM 2.5 was in 2002, with a median of 17.4 ug/m3 LC, and a maximum value of 72.4 ug/m3 LC. PM 2.5 values have significantly gone down in 2022, with a median of 10.3 ug/m3 LC and a max of 56.00ug/m3. 2022 concentrations of PM 2.5 are consistently below 20 ug/m3, with few outliers while 2002 concentrations are more scattered over a higher variance of values.